3 Model Architecture
encoder-decoder structure
the encoder maps an input sequence of symbol representations (x1,...,xn) to a sequence of continuous representations z = (z1,...,zn).
「エンコーダはシンボルの表現の入力系列x (x1,...,xn)を連続値の表現の系列z (z1,...,zn)に対応させる」
Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time
「zが与えられたとき、デコーダはシンボルの出力系列y (y1, ..., ym) を一度に一要素ずつ生成する」
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder
Figure 1:左がencoder、右がdecoder
Nx: encoderもdecoderもN個積まれている
decoderには1つ前の出力が入っていると思われる(shifted right)
Figure 1の説明
Figure 2の説明(scaled dot-product attentionとmulti-head attention)
3.3 Position-wise Feed-Forward Networks
Figure 1の「Feed Forward」(encoderにもdecoderにもある)
each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.
This consists of two linear transformations with a ReLU activation in between.
2つの線形変換は xW1+b1とxW2+b2
max(0, xW1+b1)がReLUの作用
3.4 Embeddings and Softmax
we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d_model.
We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30].
TODO 埋め込みと線形変換を共有できる?(出力のembeddingがよくわかっていない)
In the embedding layers, we multiply those weights by √dk
Table 1: Transformerの計算量はn(系列長)の2乗